Informative spectro-temporal bottleneck features for noise-robust speech recognition
نویسندگان
چکیده
Spectro-temporal Gabor features based on auditory knowledge have improved word accuracy for automatic speech recognition in the presence of noise. In previous work, we generated robust spectro-temporal features that incorporated the power normalized cepstral coefficient (PNCC) algorithm. The corresponding power normalized spectrum (PNS) is then processed by many Gabor filters, yielding a high dimensional feature vector. In tandem processing, an MLP with one hidden layer is often employed to learn discriminative transformations from front end features, in this case Gabor filtered power spectra, to probabilistic features, which are referred as PNS-Gabor MLP. Here we improve PNS-Gabor MLP in two ways. First, we select informative Gabor features using sparse principle component analysis (sparse PCA) before tandem processing. Second, we use a deep neural network (DNN) with bottleneck structure. Experiments show that the high-dimensional Gabor features are redundant. In our experiment, sparse principal component analysis suggests Gabor filters with longer time scales are particularly informative. The best of our experimental modifications gave an error rate reduction of 15.5% relative to PNS-Gabor MLP plus MFCC, and 41.4% better than an MFCC baseline on a large vocabulary continuous speech recognition task using noisy data.
منابع مشابه
An Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition
Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck fea...
متن کاملSpectro-temporal modulations for robust speech emotion recognition
Speech emotion recognition is mostly considered in clean speech. In this paper, joint spectro-temporal features (RS features) are extracted from an auditory model and are applied to detect the emotion status of noisy speech. The noisy speech is derived from the Berlin Emotional Speech database with added white and babble noises under various SNR levels. The clean train/noisy test scenario is in...
متن کاملMethods for capturing spectro-temporal modulations in automatic speech recognition
Psychoacoustical and neurophysiological results indicate that spectro-temporal modulations play an important role in sound perception. Speech signals, in particular, exhibit distinct spectro-temporal patterns which are well matched by receptive fields of cortical neurons. In order to improve the performance of automatic speech recognition (ASR) systems a number of different approaches are prese...
متن کاملSpectro-temporal directional derivative features for automatic speech recognition
We introduce a novel spectro-temporal representation of speech by applying directional derivative filters to the Melspectrogram, with the aim of improving the robustness of automatic speech recognition. Previous studies have shown that two-dimensional wavelet functions, when tuned to appropriate spectral scales and temporal rates, are able to accurately capture the acoustic modulations of speec...
متن کاملPhoneme Classification Using Temporal Tracking of Speech Clusters in Spectro-temporal Domain
This article presents a new feature extraction technique based on the temporal tracking of clusters in spectro-temporal features space. In the proposed method, auditory cortical outputs were clustered. The attributes of speech clusters were extracted as secondary features. However, the shape and position of speech clusters change during the time. The clusters temporally tracked and temporal tra...
متن کامل